372 research outputs found
Automated Audio Captioning with Recurrent Neural Networks
We present the first approach to automated audio captioning. We employ an
encoder-decoder scheme with an alignment model in between. The input to the
encoder is a sequence of log mel-band energies calculated from an audio file,
while the output is a sequence of words, i.e. a caption. The encoder is a
multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a
multi-layered GRU with a classification layer connected to the last GRU of the
decoder. The classification layer and the alignment model are fully connected
layers with shared weights between timesteps. The proposed method is evaluated
using data drawn from a commercial sound effects library, ProSound Effects. The
resulting captions were rated through metrics utilized in machine translation
and image captioning fields. Results from metrics show that the proposed method
can predict words appearing in the original caption, but not always correctly
ordered.Comment: Presented at the 11th IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics (WASPAA), 201
Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network
This paper proposes to use low-level spatial features extracted from
multichannel audio for sound event detection. We extend the convolutional
recurrent neural network to handle more than one type of these multichannel
features by learning from each of them separately in the initial stages. We
show that instead of concatenating the features of each channel into a single
feature vector the network learns sound events in multichannel audio better
when they are presented as separate layers of a volume. Using the proposed
spatial features over monaural features on the same network gives an absolute
F-score improvement of 6.1% on the publicly available TUT-SED 2016 dataset and
2.7% on the TUT-SED 2009 dataset that is fifteen times larger.Comment: Accepted for IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP 2017
A Recurrent Encoder-Decoder Approach with Skip-filtering Connections for Monaural Singing Voice Separation
The objective of deep learning methods based on encoder-decoder architectures
for music source separation is to approximate either ideal time-frequency masks
or spectral representations of the target music source(s). The spectral
representations are then used to derive time-frequency masks. In this work we
introduce a method to directly learn time-frequency masks from an observed
mixture magnitude spectrum. We employ recurrent neural networks and train them
using prior knowledge only for the magnitude spectrum of the target source. To
assess the performance of the proposed method, we focus on the task of singing
voice separation. The results from an objective evaluation show that our
proposed method provides comparable results to deep learning based methods
which operate over complicated signal representations. Compared to previous
methods that approximate time-frequency masks, our method has increased
performance of signal to distortion ratio by an average of 3.8 dB
Zero-Shot Audio Classification via Semantic Embeddings
In this paper, we study zero-shot learning in audio classification via
semantic embeddings extracted from textual labels and sentence descriptions of
sound classes. Our goal is to obtain a classifier that is capable of
recognizing audio instances of sound classes that have no available training
samples, but only semantic side information. We employ a bilinear compatibility
framework to learn an acoustic-semantic projection between intermediate-level
representations of audio instances and sound classes, i.e., acoustic embeddings
and semantic embeddings. We use VGGish to extract deep acoustic embeddings from
audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to
generate either label embeddings from textual labels or sentence embeddings
from sentence descriptions of sound classes. Audio classification is performed
by a linear compatibility function that measures how compatible an acoustic
embedding and a semantic embedding are. We evaluate the proposed method on a
small balanced dataset ESC-50 and a large-scale unbalanced audio subset of
AudioSet. The experimental results show that classification performance is
significantly improved by involving sound classes that are semantically close
to the test classes in training. Meanwhile, we demonstrate that both label
embeddings and sentence embeddings are useful for zero-shot learning.
Classification performance is improved by concatenating label/sentence
embeddings generated with different language models. With their hybrid
concatenations, the results are improved further.Comment: Submitted to Transactions on Audio, Speech and Language Processin
- …